Iter 52: adaptation + detection precision sweep (20 iterations, 170 tests)#27
Open
abailey81 wants to merge 64 commits into
Open
Iter 52: adaptation + detection precision sweep (20 iterations, 170 tests)#27abailey81 wants to merge 64 commits into
abailey81 wants to merge 64 commits into
Conversation
…erate)
Before: _infer_direction used OR (IKI >= +20% OR edits >= +50%) to
label rising_load, but _keystroke_fired used AND for the same
direction. Mismatch meant single-signal shifts (clear IKI rise OR
clear edit spike but not both) silently dropped unless the embedding
magnitude independently crossed.
After: keystroke_fired now has two tiers.
* STRONG — both IKI >= +20% AND edits >= +50% (documented brief
trigger; unchanged).
* MODERATE — single-signal escalation: IKI >= +35% OR edits >= +120%.
Catches the cases where one channel dominates.
Tests: tests/test_shift_detector.py (new, 13 cases) covers warmup,
strong tier, both moderate tiers (the iter-1 cases), the
below-threshold negative case, falling_load AND rule, embedding-only
trigger, debounce, determinism, defensive coercion, end_session,
and LRU eviction. All 13 pass.
No regressions in adaptation/encoder/feature-vector/state-badge/
linguistic suites.
Before: deviation() and get_std() used the population variance estimator (m2 / n). At early-session counts (n=5–10) this is biased low, which under-estimates noise and inflates z-scores. After: both functions use the Bessel-corrected sample variance (m2 / (n - 1)). The deviation function additionally guards against n=1 (no defined sample variance). The downstream impact: deviation features in the 32-d feature vector (iki_deviation, length_deviation, vocab_deviation, etc.) now produce calibrated z-scores from the very first turn that clears the warmup gate. Affect-shift detection and the state classifier consume those features, so this lifts precision on both detectors without any other code changes. Tests: tests/test_baseline_tracker.py (new, 11 cases) cover Welford correctness vs statistics.fmean / .variance, Bessel-corrected z-score reference, warm-up gate, extreme-value clamping, degenerate-distribution guard, n=1 guard, reset, unknown-feature defensive default, and 5000-sample numerical stability. All 11 pass. 80/80 across affect/feature/adaptation/encoder/state-badge/ linguistic suites — no regressions.
…condary-state gap (0.15 -> 0.20)
Before: T=0.2 sharpened the softmax so aggressively that even on
borderline calm/focused inputs (cognitive_load right at the
0.4-0.65 boundary) the winner saturated near 0.95 and the runner-up
sat below 0.05. The secondary-state surfacing rule ("gap < 0.15")
almost never fired, so the badge UI reported a single state with
high confidence even when the underlying scores were genuinely
ambiguous.
After: T=0.35 + gap-threshold 0.20 (tuned together).
* Clean wins (cl=0.25 calm, cl=0.85 stressed) still land in the
0.6-0.8 confidence band — high enough to feel confident.
* Genuinely ambiguous inputs (calm/focused boundary, stressed/
distracted overlap) now land at 0.4-0.55 confidence with the
runner-up surfaced. The badge can show "calm/focused" rather
than feigning certainty.
* Same argmax determinism — no jitter introduced.
Tests: tests/test_state_classifier_calibration.py (new, 13 cases)
covers determinism, clean-win confidence band, both borderline
secondary-surfacing scenarios, [0,1] confidence bounds, warm-up
short-circuit, contributing-signals correctness, and defensive
handling of empty/NaN/nested adaptation dicts. All 13 pass.
93/93 across affect/feature/adaptation/encoder/state-badge/
linguistic suites — no regressions.
Before: topic_coherence used rounding-Jaccard at 0.1 resolution over
(type_token_ratio, formality, flesch_kincaid). A 0.05 shift across
all three features could cross every rounding boundary and collapse
coherence from 1.0 to 0.0. Discontinuous and visibly wrong on
trajectories.
After: cosine similarity over the same three-feature signature,
each feature centred at 0.5 first so the measure is direction of
deviation rather than raw magnitude. Mapped from [-1, 1] to [0, 1]
to keep the 0.0=different / 1.0=identical convention.
Properties:
* Smooth — small input changes produce small output changes.
* Bounded — always in [0, 1].
* Defensible edge cases — both-zero ⇒ 1.0 (no-signal identical
turns), one-zero ⇒ 0.5 (orthogonal in standard cosine sense).
Tests: tests/test_topic_coherence.py (new, 6 cases) covers range
([0,1] sanity), continuity under small perturbation, identical-
history high-coherence assertion, far-apart signatures producing
notably lower coherence, empty-history zero, determinism. All 6
pass. 99/99 across the adaptation/detection regression sweep.
…lative)
Drives the full pipeline (FeatureExtractor + BaselineTracker +
classify_user_state + AffectShiftDetector) turn-by-turn on synthetic
user trajectories and asserts the user-visible behaviour.
Scenarios (11 cases):
* calm baseline - 10 calm turns -> no shifts, no false stressed/
tired/distracted labels.
* rising load - calm 5 -> rushed 4 -> rising_load shift fires.
* falling load - stressed 5 -> recovery 4 -> falling_load shift fires.
* tired user - slow IKI + low engagement -> 'tired' label appears.
* distracted user - high IKI variance + normal mean -> 'distracted'.
* borderline - calm/focused boundary -> secondary state OR boundary
flip surfaces.
* topic coherence - consistent style stays high; style shift drops.
* debounce - sustained stress emits 1-2 suggestions, not many.
* baseline z-scores - calm probe stays bounded in [-1, 1].
* 50-turn smoke - random inputs run without exception, no NaN/inf.
This is the regression suite the user explicitly asked for - 'test
and emulate usage yourself'. It validates iter 1-4 cumulatively at
the end-to-end level. 110/110 pass across the adaptation/detection
test surface.
Before: _safe_embedding kept whatever flat dim the input came in with. _embedding_magnitude then called torch.stack on the per- session ring buffer, which raises RuntimeError on mixed shapes (e.g. a 32-dim warm-up turn followed by 64-dim steady-state turns). The exception fallback silently returned magnitude=0.0 — meaning a real shift could be missed during any embedding-shape transition. After: every input canonicalises to the canonical 64-dim shape inside _safe_embedding (zero-pad short, truncate long, flatten multi-dim first). The ring buffer is now always stack-compatible. torch.stack succeeds on every observation set, magnitude reflects real differences, no silent drops. Also tightened the .to() call to specify device='cpu' so a CUDA- resident input (e.g. from a batched encoder) is normalised before storage. Tests: 5 new cases in tests/test_shift_detector.py covering under-/over-sized embeddings, multi-dim embeddings, mixed-shape sequences, and CUDA-device portability. The mixed-shape test was strengthened to assert magnitude > 0.5 on a real shift (previously it was satisfied by the silent fallback); now passes only because of the canonicalisation. 115/115 across the regression sweep.
Documentation said: 'Baseline = the *first* min(N, len) observations
the detector saw in this session — anchors what does this user
normally look like?'
Implementation said: 'baseline_window = observations[:-recent_size]'
i.e. a rolling tail of the ring buffer. These disagreed.
The rolling-tail version has a precision gap: on a long session
where the user's pattern shifts and stays shifted, the rolling
baseline drifts toward the new normal. Eventually 'baseline_window'
contains all post-shift observations, recent_window is also post-
shift, the deltas zero out, and a sustained shift silently stops
being detected. The user's *original* normal is the right anchor.
Iter 7 implements the documented behaviour:
* New per-session _baselines: dict[key, list[_Observation]]
populated by the FIRST baseline_size observations of the
session. Once full, never changes for the lifetime of the
session.
* baseline_size defaults to max(2, window_size - recent_size)
so short-session callers see no behaviour change.
* Detection now compares recent window (rolling) vs the fixed
baseline. Sustained shifts produce stable, non-decaying signals.
* end_session and LRU eviction wipe the new dict alongside the
rolling buffer.
Tests: 2 new cases in tests/test_shift_detector.py.
* test_fixed_baseline_anchor_persists_across_long_session: 5 calm
+ 15 sustained-stress; rolling baseline would silently drop
detection by turn ~13; fixed baseline still fires with delta
>= +50% throughout.
* test_fixed_baseline_resets_on_end_session.
117/117 across the regression sweep — no other test relied on the
rolling-tail semantics.
Before: slope / abs(y_mean). At small y_mean, this blew up — a
slope of 0.01 with y_mean=0.001 produced 10, the downstream caller
clamped to 1.0, and a saturated trend signal bore no relation to
the actual change. Multiple low-magnitude features (length_trend,
latency_trend, vocab_trend) all pinned at ±1 hid real signal from
the TCN encoder and the smart router.
After: slope * (n - 1) = total change across the window. For
[0, 1]-bounded inputs this is naturally in [-1, 1] — full 0->1
rise returns +1.0, 1->0 fall returns -1.0, flat returns 0.0.
Properties:
* Continuous in slope (small change -> small change).
* Mean-invariant (a constant shift to all values doesn't change
the slope).
* Always finite.
* Returns the actual change for slow rises rather than saturating
at 1.0 from mean-division blow-up.
Tests: 11 cases in tests/test_normalised_slope.py covering trivial
cases (empty, single, flat), full-range rises and falls, the small-
y_mean stability fix (the iter 8 motivation), zero-y_mean with real
slope, continuity under small perturbations, determinism, and
robustness to large negative values + floating-point jitter.
128/128 across the regression sweep — no other test relied on the
old slope/y_mean semantics.
Before: AffectShift exposed magnitude (sigma units) and per-signal
deltas (signed %), but not a single normalised confidence number.
Downstream consumers (the chip UI, the explain_panel narration) had
to re-derive confidence from raw deltas to communicate trust.
After: AffectShift carries a 'confidence' float in [0, 1].
* 0.0 when not detected.
* [0.5, 1.0] when detected, where 0.5 = a tier just crossed,
1.0 = strong multi-tier corroboration.
Computation combines two evidence streams (max):
* Embedding evidence — ramp from 0 at magnitude_threshold to 1 at
3 x magnitude_threshold.
* Keystroke evidence — direction-specific ramps:
- rising_load: max(IKI ramp 20%->100%, edit ramp 50%->300%)
- falling_load: IKI ramp -15%->-60% (negative).
Tests: 5 new cases in tests/test_shift_detector.py.
* Undetected shifts have confidence == 0.0.
* Detected shifts have confidence in [0.5, 1.0].
* Strong shifts have higher confidence than weak shifts (apples-
to-apples comparison with full recent windows on both sides).
* to_dict() serialisation includes the field.
* Falling-load confidence scales with the IKI drop magnitude.
Purely additive change to the AffectShift dataclass — existing
consumers ignore unknown keys. 133/133 across the regression sweep.
Goes beyond the single-user scenarios in test_user_emulation.py by
simulating a realistic cohort of users with persistent typing
personalities across multiple sessions.
Five archetypes (each grounded in keystroke-dynamics literature):
* speed_typist — fast, smooth, few edits.
* thoughtful_writer — moderate pace, occasional edits.
* hunt_and_peck — slow, irregular, frequent corrections.
* multitasker — normal mean IKI, high variance.
* anxious_typist — slow + many edits + low engagement.
Five test cases:
* Per-user baselines track individual IKI (slow user's baseline
reads slow; fast user's reads fast — the BaselineTracker is
learning the right thing per-user with the iter-2 Bessel-
corrected variance).
* Cross-user state isolation (one user's high-load doesn't leak
into another's classifier output).
* Session-boundary reset (end_session followed by new session
returns the detector to warm-up — verifies iter-7 fixed-baseline
cleanup is correct).
* Cohort full run no NaNs (all 5 archetypes through 30 turns each;
no non-finite values leak into any feature, classifier output,
or shift result; confidence stays in [0, 1]).
* Archetype classification bias (each archetype produces an
expected dominant or co-occurring label).
138/138 across the regression sweep — every iter 1-10 commit holds.
Same precision improvement as iter 6's shift_detector fix, applied to the Identity Lock biometric authenticator. Before: _coerce_embedding flattened multi-dim inputs but didn't canonicalise the flat dim. A mismatched-shape embedding would silently produce cosine_sim=0 inside _score_match (via _safe_cosine's shape-mismatch guard), and the user appeared unrecognisable for that turn even when the underlying vectors agreed on the prefix dims. This is a precision bug for any session that mixes embedding sizes — e.g. a TCN warm-up turn at a transient dim followed by steady-state 64-dim turns. After: every input canonicalises to the canonical 64-dim shape (zero-pad short, truncate long, flatten multi-dim first). Cosine sees aligned shapes and returns the actual similarity. Tests: tests/test_keystroke_auth_robustness.py (new, 6 cases) covers under-/over-sized embeddings, multi-dim embeddings, mixed-shape sequences, None inputs, and NaN inputs. 153/153 across the regression sweep — every iter 1-11 invariant holds.
…icator
4 new cases covering the multi-user invariants of the Identity Lock:
* Separate users have independent templates: a single shared
KeystrokeAuthenticator instance keeps per-user state fully
isolated. Sending alice's embedding under bob's user_id
produces a noticeably lower similarity than under alice's
user_id.
* User template persists after other users' observations: a long
stream of observations from a different user must not drift
any given user's template.
* Force-register isolates per-user: stamping charlie's template
leaves alice and bob's templates intact.
* Reset-for-user only affects that user.
Closes a gap in the existing biometric coverage (test_biometric_match_
dataclass.py covered the dataclass shape only). 157/157 across the
regression sweep.
5 property tests using Hypothesis to fuzz random sequences and
inputs against the AffectShiftDetector / BaselineTracker /
classify_user_state invariants:
* Random sequences of (embedding, comp_ms, edits, pause_ms,
iki_mean, iki_std) tuples driven through observe() must never
raise, never produce non-finite outputs, never violate the
confidence convention (== 0.0 when not detected, in [0.5, 1.0]
when detected).
* Pathological embeddings (None, zero-length, NaN, inf) handled.
* BaselineTracker.deviation always lands in [-1, 1].
* BaselineTracker.get_std always finite, non-negative.
* State classifier always returns a valid state label and a
confidence in [0, 1] for any combination of inputs.
Hypothesis is already a dev dependency (6.152.2 in .venv). Each
property was fuzzed with up to 100 examples (~360 random cases
total) — no counter-examples found. This complements the
specific-scenario tests with broad-input invariant coverage.
162/162 across the regression sweep.
Before: _embedding_magnitude divided by max(sigma_baseline, 1e-3). On a session with very consistent embeddings (sigma close to 0) the divisor floored to 1e-3 and any tiny L2 distance produced a massive magnitude — e.g. a 0.005 perturbation against a zero baseline yielded magnitude=40, far over the 1.4-sigma threshold. False-positive shifts on stable users. After: _embedding_magnitude returns 0.0 when sigma_baseline is below 1e-2. At that variance level the baseline carries no useful information about normal embedding noise, so we let the keystroke channel be the sole detector (documented fallback). Tests: 1 new case covering the false-positive regression. No silent embedding fires on stable users; keystroke channel still fires correctly. 163/163 across the regression sweep.
The features._std helper was still using the population estimator (divide by n) after iter 2 fixed the same issue in BaselineTracker. At small sample sizes the population estimator under-estimates noise, which inflates time_deviation in the session-features extractor (it divides by this std). Switched to the Bessel-corrected sample estimator (divide by n - 1). Returns 0.0 for n < 2 (sample variance undefined for a single obs). 163/163 across the regression sweep.
…ate shift_detector
Three new property-based test cases:
* BaselineTracker handles unbounded inputs (outside [0, 1]) without
breaking — Welford's algorithm is range-agnostic, deviation
clamping is independent of input range.
* BaselineTracker remains numerically stable over long streams
(100-2000 updates) — no NaN/inf accumulates in mean or std.
* shift_detector at 12-turn steady-state on any constant input
never produces NaN/inf anywhere in its output (covers the worst
case for rolling + fixed-baseline buffer interaction).
166/166 across the regression sweep.
Drives all four core components end-to-end through a synthetic
100-turn session that transitions through 5 phases:
* 0-19: calm baseline
* 20-39: rising load (gradual)
* 40-59: sustained stress
* 60-79: recovery (faster-than-baseline typing)
* 80-99: second stress wave (high-variance shape)
Components exercised together at every turn:
FeatureExtractor.extract -> BaselineTracker.update ->
classify_user_state -> AffectShiftDetector.observe ->
KeystrokeAuthenticator.observe.
Per-turn invariants asserted:
* All 32 fv fields finite.
* State label in valid vocabulary; confidence in [0, 1].
* Shift confidence == 0 if not detected; in [0.5, 1.0] if detected.
* iki_delta_pct, edit_delta_pct, magnitude all finite.
* Authenticator state in valid vocabulary; confidence in [0, 1].
Post-run trajectory invariants:
* Calm phase produces no high-load labels.
* Rising-load phase fires a rising_load shift.
* Recovery phase fires a falling_load shift (proves iter-7
fixed-baseline + iter-1 tiered trigger work together correctly).
* Late stress phase produces high-load labels.
* Authenticator reaches registered/verifying mid-session.
* Per-user baseline mean stays bounded.
This is the gold-standard validation for iters 1-17 jointly.
167/167 across the regression sweep.
Two new property tests:
* Confidence is monotonically non-decreasing as IKI delta rises
(across IKI 150 -> 340 ms recent against a 120 ms baseline).
* Confidence is monotonically non-decreasing as edit count rises
(across edits 2 -> 12 against a 0-edit baseline).
Catches calibration regressions where a stronger signal would
accidentally produce weaker confidence — the kind of bug that
would erode trust in the routing chip. 169/169 across the
regression sweep.
Before: max(0.0, min(1.0, NaN)) returns NaN under IEEE 754 arithmetic. If an upstream feature extractor mis-fires and emits NaN, the clamping helpers propagate it into the feature vector, corrupting downstream classifier and shift-detector outputs. After: both helpers coerce non-finite inputs (NaN, +inf, -inf) to 0.0. Sane defaults that downstream consumers expect; NaN can no longer leak through the feature pipeline regardless of how the upstream extractor mis-behaves. 170/170 across the regression sweep.
11 snapshot tests with tight numerical tolerance. Each test
captures the exact post-iter-20 behaviour on deterministic inputs:
* BaselineTracker known z-score (iter 2 Bessel correction).
* features._std known value (iter 15).
* _normalised_slope known boundary values (iter 8).
* _clamp01 / _clamp_neg1_1 NaN/inf handling (iter 20).
* AffectShiftDetector canonical scenario (5 calm + 1 stressed)
with exact iki_delta_pct, edit_delta_pct, magnitude, confidence
range (iters 1, 7, 9).
* state_classifier clean calm + stressed labels with confidence
band assertions (iter 3).
* topic_coherence cosine similarity edge cases — identical,
anti-correlated, zero-zero, one-zero (iter 4).
If any future change shifts a numeric output beyond the tolerance,
this suite fires loudly — forcing a deliberate decision about
whether the change is intentional.
181/181 across the regression sweep (17 test suites).
The cosine helper added in iter 4 didn't guard against non-finite inputs. Even though iter 20's _clamp01 / _clamp_neg1_1 prevent NaN from reaching it via the normal flow, a future regression slipping a non-finite value past those clamps would have produced NaN coherence — which then propagates through the feature vector and corrupts downstream classifier / shift-detector outputs. After iter 22: any non-finite input returns 0.5 (the documented midpoint for 'no signal to compare'). Tests cover NaN, +inf, -inf, and mixed pathological/normal vectors. 182/182 across the regression sweep.
Iter 23 — complements the deterministic 100-turn integration test with Hypothesis-driven random sequences. Random tuples of (IKI mean, IKI std, composition_ms, edits, engagement) are generated as 5-40-turn sessions and fed through: FeatureExtractor + BaselineTracker -> classify_user_state -> AffectShiftDetector + KeystrokeAuthenticator. All output invariants asserted at every turn: * Every fv field finite. * State label in valid vocabulary; confidence in [0, 1]. * Shift confidence == 0 if not detected; in [0.5, 1.0] if detected. * Authenticator state in valid vocabulary; confidence in [0, 1]. Up to 30 generated cases x 5-40 turns each = ~600+ randomly-sampled end-to-end runs without finding a counter-example. Catches any multi-component interaction bug that single-component fuzzing misses. 183/183 across the regression sweep.
…gime) Confirms that when sigma_baseline >= the iter-14 floor, the embedding-magnitude trigger continues to fire normally. Prevents a future regression where the floor accidentally disables the embedding channel in normal-variance regimes.
Before: a 0-dim scalar tensor (e.g. torch.tensor(0.5)) had ndim=0, which the 'ndim > 1' guard didn't catch. It then fell through to torch.cat which raises 'zero-dimensional tensor cannot be concatenated'. An upstream caller sending a degenerate scalar crashed the affect-shift detector AND the keystroke authenticator. After: 'ndim != 1' triggers the flatten path for any non-1D input (0-dim scalars OR multi-dim tensors). The detector and the authenticator both handle scalar inputs by zero-padding to 64-dim. Tests: 1 case each in tests/test_shift_detector.py and tests/test_keystroke_auth_robustness.py. 186/186 across the regression sweep.
Pins the iter-7 fixed-baseline interaction with the iter-1 falling- load condition: 5-stressed baseline + 3-recovery turns must fire as falling_load with iki_delta <= -15% and edits flat-or-falling. Catches any regression that breaks the recovery-detection path. 187/187 across the regression sweep.
Four additional snapshot tests pinning the state_classifier
behaviour:
* Warming-up dominates first message regardless of keystroke
signals (the iter-3 calibration must not break this).
* Warming-up decays correctly once baseline is established.
* Borderline cognitive_load (0.45) + raised formality surfaces
calm/focused as primary+secondary.
* Pathological NaN/inf inputs produce a valid label with finite
confidence in [0, 1].
191/191 across the regression sweep.
Pins the iter-1 suggestion picker's determinism: the same (user_id, session_id, turn_index, direction) tuple always produces the same suggestion text — so the demo trace doesn't drift between rehearsals.
…and state_classifier
Iter 29 — when shift_detector reports a strong rising_load with
high confidence, state_classifier on the same recent parameters
should likewise weight the high-load candidates (stressed / tired /
distracted) — not return calm / focused.
A divergence between these two would surface to the user as a
contradictory state-badge + suggestion — the kind of bug that
erodes calibrated trust in the routing chip.
Four scenario tests:
* Strong rising_load shift -> high-load classifier label
* No shift -> low-load classifier label (calm / focused)
* Falling_load (recovery) -> calm / focused label
* Confidence floors aligned: shift >= 0.5 AND classifier >= 0.35
on the same scenario
196/196 across the regression sweep (18 test suites).
Iter 30 — genuine precision improvement to the AffectShift.confidence
calibration.
Before: confidence = max(emb_evidence, ks_evidence). When BOTH
channels fired strongly, confidence was identical to a single-channel
fire of the same maximum strength. That's miscalibrated: corroborating
evidence should produce higher trust than a lone signal.
After: confidence combines as max + 0.25 * min, clamped to [0, 1].
* Single-channel firings unchanged (the max baseline still
dominates) — preserves the iter-19 monotonicity invariants.
* Corroborating channels lift the score above either alone.
* For rising_load: combines IKI ramp + edit ramp with the bonus.
* For falling_load: combines IKI-drop ramp + edit-DECREASE ramp
(an edit drop is corroborating evidence of recovery, beyond the
falling-load trigger's flat-or-falling minimum requirement).
* Across embedding + keystroke channels: same bonus when both fire.
Tests:
* Corroborated IKI + edit signals produce strictly higher confidence
than single-channel of the same IKI delta.
* Falling_load with sharp edit-drop produces strictly higher
confidence than falling_load with flat edits.
* iter-19 monotonicity tests still pass (single-axis sweeps see
no change since min(0, x) = 0).
198/198 across the regression sweep.
Iter 35 — second visible-adaptation precision win. Before: StyleMirrorAdapter computed verbosity = message_length / 0.7. Since FeatureExtractor normalises message_length against a 500-word ceiling, real chat (5–50 words) produced message_length 0.01–0.10 which mapped to verbosity 0.014–0.143. Every chat-sized message read as 'very low verbosity' and the post-processor's hedge-strip path always fired regardless of how the user actually typed. The adapter never adapted. After: verbosity = message_length / 0.10. Calibrated for chat-sized messages: ~5-word msg -> verbosity ~0.30 (terse → strip hedges) ~25-word msg -> verbosity ~0.40 (default → no rewrite) ~50-word msg -> verbosity ~0.69 (verbose → append follow-up) ~100-word msg -> verbosity ~0.84 (saturated) The post-processor now produces visibly different reply shapes based on how much the user typed. Tests: pinned the chat-sized verbosity calibration in tests/test_cognitive_load_dynamic_range.py (5-word stays sub-strip- threshold, 25-word at boundary, 50-word at follow-up threshold, monotonic in message length). 207/207 across the regression sweep.
…ow reachable Iter 36 — third visible-adaptation precision win. Before: directness = question_ratio*0.3 + (1-question_ratio)*0.7 mapped to [0.3, 0.7] — exactly half of [0, 1]. But the cloud prompt-builder gates 'be more direct' on directness > 0.7 (strict inequality), so that path was DEAD CODE — pure statements (the case where the instruction would fire) maxed at exactly 0.7. After: directness = 0.85 - 0.7 * question_ratio maps to [0.15, 0.85]. question_ratio=0 -> 0.735 (statements, > 0.7 -> be direct fires) question_ratio=0.5 -> 0.500 (default) question_ratio=1 -> 0.230 (questions, < 0.3 -> be inquisitive fires) The post-processor's directness instructions actually become reachable for the first time. 207/207 across the regression sweep.
…to reply shaping
Iter 37 — the user complaint 'the answers must actually be shaped'.
Before: post-processor adapted on 4 axes (cognitive_load, formality,
verbosity, accessibility). Three axes (directness, emotional_tone,
emotionality) were computed by the AdaptationController but
*completely ignored* by the response post-processor. After iter
34-36 fixed the input-side dynamic range, the output-side still
saw only half the adaptation signal.
After: all 7 axes shape the reply. Three new transforms, all
subtraction-only (respecting the 'never generate new content'
design rule):
* directness > 0.7 -> strip soft openers ('you might want to
consider', 'perhaps you could', 'feel free to'). When the user
types declaratively, replies become assertive.
* emotional_tone > 0.7 -> strip warm interjections ('Sure!',
'Of course!', 'Happy to help!') and collapse exclamation
points to periods. When the user wants neutral / objective
tone, replies become clinical.
* style_mirror.emotionality < 0.3 -> strip emotive intensifiers
('absolutely', 'incredibly', 'amazingly'). When the user is
measured / dispassionate, replies stop sounding hyperbolic.
Plus a sentence-capitalisation pass at the end that fixes the
case where stripping leaves a lowercase word at the start of a
sentence.
Tests: tests/test_response_shaping.py (new, 23 cases)
* Per-axis: 8 cases proving each axis (now all 7) actually shapes
the reply.
* End-to-end: 5 user-state scenarios proving identical raw
replies produce visibly different post-processed text.
* Invariants: determinism, capitalisation, non-emptiness, log
shape.
* Spread: 81-state combinatorial sweep producing >= 8 distinct
outputs.
230/230 across 20 regression suites.
…+ live-UI fixes
Iter 38 — typing rhythm now feeds cognitive_load. A stressed user
typing a short message ("ugh just tell me") used to get the same
cognitive_load as a calm user typing the same content because the
adapter only looked at message-content complexity. Now editing_effort,
backspace_ratio, and positive iki_deviation contribute via a max() of
rhythm signals (mean diluted real signals when iki_deviation collapsed
to 0 on degenerate baselines), with a +0.20 boost above the 0.20
threshold so a 4-edit user crosses from the 4-sentence tier into the
2-sentence tier.
Iter 40 — StyleMirror smoothing rate 0.2 -> 0.35 so that consistent
declarative messages cross the directness > 0.7 threshold within 2
turns instead of 4. In the 60-user emulation this drove directness
firing from 13/60 to 36/60.
Iter 41 — fixed three live-UI bugs surfaced by actual usage:
* server expected ``edit_count`` but the JS client (KeystrokeMonitor)
sends ``backspace_count``. Server now accepts both.
* server expected ``inter_key_interval_ms`` on keystroke events but
the JS client sends ``iki_ms``. Without the fallback, every
keystroke buffered a zero IKI and the dashboard's "Typing rhythm"
tile read 0 ms on every turn.
* counterfactual sensitivity output reported "formality would have
been 1.089" — linearly extrapolated past the [0, 1] envelope.
Clamped both feature and dimension values.
60-user emulation results:
* 100% of synthetic users get visible shaping (>=1 axis fired)
* cognitive_load mean spread by archetype: calm 0.46, verbose 0.75,
stressed 0.89, tired 0.87
* reply length spread: 20–275 chars, std 98
* 1/12 precision miss (long_q saturates cl from content alone)
Iter 42 — when a calm user and a rhythm-stressed user both type the same complex content, content-driven cognitive_load lands them in the same 0.6-0.8 tier (2 sentences) and the post-processor produces identical shaped replies. The 60-user emulation harness flagged this as the lone precision miss out of 12 message types. Added a 0.55-0.65 -> 3 sentence band so that calm users typing verbose content get 3 sentences while rhythm-stressed users typing the same content get 2. Pinned with a regression test: ``test_complex_message_differentiates_calm_from_stressed``. After this change the emulation reports 0/12 precision misses and directness firing rises from 36/60 to 40/60 (the 3-sentence cap preserves enough content for the directness rewrite to stay relevant on the trimmed reply).
… 0.5 Iter 43 — pre-fix the formula was ``tone = 0.5 - distress*0.5``, so ``emotional_tone`` could never exceed 0.5 in practice. That made the post-processor's ``emotional_tone > 0.7`` warmth-stripping branch dead code: a third dead-code path after iter 36 (directness) and iter 37 (post-processor wiring). A user with strong positive sentiment is asking the system not to hand-hold — they're fine and want the answer, not the comfort. ``neutrality_score = max(0, sentiment_valence)`` is now an additive drive that pushes tone above 0.5 toward 1.0, finally reaching the warmth-strip threshold. Deliberately do NOT use ``features.formality`` for the neutrality drive: the formality scorer is purely subtractive (1.0 minus the rate of contractions and slang) so plain chat without explicit informal markers reads at maximum 1.0. Wiring that into the tone formula made the warmth-strip path fire on every neutral chat message — the exact false-positive iter 36/37 went out of its way to avoid. After the fix the 60-user emulation maintains 0/12 precision misses, 8 distinct shaped replies, and 100% visible-shaping coverage.
Iter 44 — the formality scorer was purely subtractive (``1.0 - informal_rate``) so plain chat without explicit slang/ contractions read as 1.0 (max formal). Every casual message ranked the same as a legal brief. This poisoned every downstream consumer of features.formality, including the StyleMirror's formality adaptation and the iter-43 emotional_tone neutrality drive. The new score combines three balanced signals around a 0.5 baseline: * informal_rate (contractions + slang) -> push down * formal_rate (new FORMAL_MARKERS set) -> push up * long_word_rate (proxy for register) -> gentle upward boost Validation samples after the fix: 0.500 'how does this work?' 0.500 'ok do it' 0.200 'yo whats up bro lol' 0.250 'gonna grab lunch idk' 0.593 'Greetings; I trust this finds you well.' 0.719 'Pursuant to your inquiry I would like to inform you regarding the matter.' 0.825 'Therefore, the consequences of inaction are demonstrably significant; we must accordingly proceed.' EmotionalToneAdapter (iter 43) now folds formality back into the neutrality drive — the iter-43 commit had to drop it because the broken scorer made every plain message false-positive on the warmth-strip path. The threshold is formality > 0.6 so genuinely formal text contributes; calibrated chat at 0.5 doesn't.
…softener strip Iter 45 — the directness softener regex used to strip 'You might want to consider' but leave the 'that' that introduced the consideration: before: 'You might want to consider that approximately five...' after: 'That approximately five...' <- grammatically broken Extended the regex to absorb a trailing '(that)? (perhaps)?' so the strip leaves a clean clause: after: 'Approximately five...' <- clean The 60-user emulation now produces 8 distinct shaped replies whose text is grammatically clean, not just visibly different.
…fficulty Iter 47 + iter 48 — the accessibility path was dead in practice: * threshold was 0.7 over a 4-signal mean, so all four signals needed to be near-maximum simultaneously * mean() diluted the score whenever ``iki_deviation`` / ``speed_deviation`` collapsed to 0 (which happens any time the user's prior turns were too uniform to define a meaningful std) Result: an editing_effort of 0.80 plus a backspace_ratio of 0.33 averaged to 0.28 — clearly motor-impaired typing was scored at zero accessibility. Iter 47: detection_threshold 0.7 -> 0.5. Iter 48: switched the aggregator from mean() to max() (same fix pattern as iter 38 for cognitive_load rhythm signals — any single strong signal is sufficient evidence; averaging dilutes baseline-degenerate channels). Verified: a normal user with 4 calm priming turns now sees accessibility=0.800 the moment they hit a true motor-difficulty turn (editing_effort=0.80, backspace=0.33), so the post-processor's vocabulary simplification finally activates when the user actually needs it.
The keystroke authenticator's ``z_comp`` divisor was
``template_comp_mean * 0.3`` — tighter than real per-user
composition-time variance. A legitimate owner who typed a 12-second
message after registering on 3-second messages got ``z_comp ≈ 11``
clipped to 5σ — and the biometric panel flagged "composition cadence
(5.0 sigma off)" on his OWN typing. The user reported this directly:
"Typing pattern diverges from registered owner".
Composition time scales with message length, so single-user variance
is realistically 50% of mean plus an absolute baseline. Widened the
divisor to ``max(template_comp_mean * 0.5, 2000.0)``:
before: 9890 ms diff / max(900, 1) = 10.99 -> 5σ (false positive)
after: 9890 ms diff / max(1500, 2000) = 4.95 -> 4.95σ (still flags
truly different
typists)
Truly impostor typists with 8+ σ deltas are still detected (clipped
at 5σ); only the false-positive band on legitimate-owner long
messages closes.
…-44 formality fix The iter-44 formality calibration changed the value pure-chat text returns from 1.0 (broken) to ~0.5 (neutral). The focused state's 'raised formality' signal keys off formality > 0.55, so chat-only users now fall cleanly into 'calm' without a focused secondary. Updated the borderline test to use slightly formal text (with three formal markers — therefore / regarding / indeed) so it actually sits on the calm/focused boundary rather than firmly inside 'calm'. The test's intent is unchanged — it still pins that the state classifier surfaces secondary labels on borderline cases — only the input was recalibrated for the post-iter-44 formality scale.
Iter 51 — the verbosity hedge stripper produced grammatically broken
output when the hedge was parenthetical:
before: 'Uzbekistan is, perhaps, a landlocked country.'
after: 'Uzbekistan is, a landlocked country.' <- dangling comma
Extended the regex with an optional leading ``(?:,\s+)?`` so the
parenthetical comma is absorbed alongside the hedge, and used a
match-aware replacer that puts back a single space when the leading
comma was present (so 'X, perhaps, Y' joins to 'X Y', not 'XY').
The leading-comma group requires an actual comma (not just leading
whitespace) so it doesn't gobble inter-word spaces in non-
parenthetical positions ('It actually borders' -> ' borders', not
'Itborders').
Validation:
In: 'I think Uzbekistan is, perhaps, a landlocked country.'
Out: 'Uzbekistan is a landlocked country.'
In: 'just go ahead.'
Out: 'Go ahead.'
In: 'You really should consider it.'
Out: 'You should consider it.'
All 46 post-processor + shaping tests still green.
…iter 53) The prompt-builder treated high cognitive_load as 'user has spare capacity, give richer detail' — per the now-stale CognitiveLoadAdapter docstring. But iter 38 (rhythm-driven cl) made high cl explicitly mean 'user is stressed', and the post-processor's length tiering trims high-cl users to 1-2 sentences. So pre-iter-53 the LLM was instructed to produce 'sophisticated vocabulary and detailed explanations' for stressed users, then the post-processor immediately trimmed that detailed prose to a single sentence — wasted tokens and inverted user-state semantics. Aligned the prompt-builder's tiers to the post-processor's: cl >= 0.8: 'reply in a single concise sentence' (matches 1-sentence trim) cl >= 0.6: 'tight, <= 2 sentences, lead with the answer' (matches 2-sentence trim) cl >= 0.4: 'moderate complexity' cl < 0.4: 'richer vocab, 4-6 sentences when warranted' (matches 4-6 sentence cap) Updated CognitiveLoadAdapter's return-doc to document the unified semantic so future contributors don't re-introduce the inversion.
…cl tier
Iter 54 — the cloud prompt-builder asks the LLM to 'keep responses
extremely short (under 15 words)' when accessibility > 0.5, but the
LLM doesn't always comply. The post-processor's length tier was
keyed only on cognitive_load — so an accessibility user with
moderate cl (0.4) got the 4-sentence tier and the LLM's 50-word
reply went through untouched.
When accessibility > 0.5 we now lift effective_cl to >= 0.85 (the
1-sentence tier), so:
before: accessibility=0.65, cl=0.4
-> 'Sure! Here is a detailed response. The first thing to
note is that this is complex. Furthermore, you should
consider the implications.' (134 chars)
after: accessibility=0.65, cl=0.4
-> 'Sure!' (5 chars)
Normal users (accessibility=0) are unaffected — same cl=0.4 still
gives the 4-sentence tier.
…sor (iter 55) Iter 55 — the prompt-builder asked for verbosity changes at < 0.3 / > 0.7 while the post-processor stripped at < 0.35 / > 0.7. The [0.30, 0.35] band let the LLM hedge while the post-processor silently stripped — wasted tokens. Same gap on formality (< 0.3 vs < 0.35; > 0.7 vs > 0.65). Aligned both thresholds to the post-processor's, and reworded the prompts so the LLM proactively skips hedges + follow-ups (verbosity low) or appends a follow-up (verbosity high) — matching what the post-processor does anyway. The cloud and post-processor now agree on the shape of every axis at every threshold.
…semantics 9 new regression tests covering: * Accessibility fires from any single strong difficulty signal (iter 48 max() aggregator) * Mild stress signals must NOT fire accessibility (iter 47 0.5 floor) * High cl prompt asks for brief reply, NOT 'sophisticated' (iter 53) * Low cl prompt allows richer 4-6 sentence depth (iter 53) * Accessibility forces <= 1 sentence at moderate cl (iter 54) * Normal users at same cl keep multiple sentences (iter 54) * Prompt-builder verbosity threshold matches post-processor's 0.35 (iter 55) * Prompt-builder formality threshold matches post-processor's 0.65 (iter 55) These tests fire immediately if a future refactor re-introduces a semantic inversion or threshold mismatch across the adaptation axes.
Iter 58 — the user reported the dashboard's 'Typing rhythm' tile still reading 0 ms after iter 41 + restart, even though composition cadence (1.29 s avg) was being recorded correctly on the same turn. Root cause: the iter 41 fix correctly routed JS-format ``iki_ms`` into the keystroke_buffer, but the buffer's first sampled keystroke event always has ``iki_ms = 0`` (no preceding keystroke). When a short message produced exactly ONE keystroke event (every 3rd keystroke is sampled — so a 3-key message lands on event #3 with keyTimings.length=2 → iki_ms is the gap between #2 and #3, OK; but when the very first sampled event is the only one, the buffer has [0]). The pre-fix server passed ``[0]`` straight to ``Pipeline._iki_stats``, which filtered the zero out and returned mean=0 — even though composition_metrics.keystroke_timings had real data right there. Three-level fallback: 1. server-side keystroke_buffer (filter to non-zero entries) 2. composition_metrics.keystroke_timings array 3. composition_metrics.mean_iki scalar (last resort: synthesize a single timing from the JS-precomputed mean) The dashboard's 'Typing rhythm' tile now never reads 0 ms when the JS client has any meaningful inter-key data — even on edge-case messages where the per-event sampler captured only zero-IKI entries.
5 new regression tests covering: * buffer-only-zeros falls through to comp_metrics.keystroke_timings * buffer non-empty (with non-zero entries) takes priority * mean_iki scalar is the last-resort fallback * truly-empty input correctly returns empty list * zero mean_iki does not pollute pipeline with [0.0] These pin the iter-58 logic so a future refactor of the keystroke-timing extraction can't silently re-introduce the 'Typing rhythm 0 ms' regression Tamer reported. Total websocket key-compat regression tests: 12. Total adaptation-related test suite: 217 / 217 PASS.
…ns (iter 60) 4 new regression tests covering: * Different archetypes produce distinct cognitive_load means (>= 0.30 spread across 5 users) * Archetype ordering is intuitive: fast < anxious < stressed * Within-user cl smoothing remains stable when other users interleave their turns (pstdev < 0.20 per user) * At least 3 of 5 users get distinct mean reply lengths These pin the per-user-state-isolation invariant — if a future refactor accidentally shares baseline / controller / extractor state across users, these tests fire immediately.
…er 61) Drives 100 deterministic-pseudo-random pathological inputs through the full FeatureExtractor + BaselineTracker + AdaptationController + ResponsePostProcessor stack and asserts: * The pipeline never raises * Every adaptation axis is a finite float in [0, 1] * Every shaped reply is non-empty (post-processor empty-fallback works) Pathological inputs include: empty / whitespace-only messages, single-char messages, 500-char repeated chars, emoji-heavy text, slang-stacked text, formal text, hedge-stacked text, multi-line content, punctuation chaos, 100-word repeated phrases, extreme IKI values (0, 1ms, 5000ms), absurd composition times (50ms to 10min), edit counts up to 100. Full 1000-iteration version lives at D:/tmp/chaos_fuzz_emulation.py. Adaptation regression suite now: 321 / 321 PASS across 21 test files.
…iter 62) User reported only seeing 2 states (calm/focused) in a real session, which was correct for their typing pattern but raised the question whether the other 4 states are reachable at all. 6 parametrized tests, one per state, with crafted signal patterns: * warming up - first message, baseline not established * calm - normal IKI, low load, no edits * focused - cognitive sweet spot 0.55 + raised formality * stressed - high load + edits + IKI variance + long composition * tired - long composition + slow IKI + low engagement * distracted - high IKI variance + normal mean + intermittent edits If a future calibration tweak makes any state unreachable, the corresponding test fires. Adaptation regression suite: 327 / 327 PASS across 22 test files.
The user reported '12 affect-shift events out of 16 messages' (75% trigger rate). The calibration emulation revealed the real issue: on a STABLE 20-turn session with no real shift, the detector fired on 80% of turns. Root cause: the magnitude formula was ``L2(diff) / σ_per_dim``. The L2 norm scales with sqrt(N) — 64 dims × per-dim σ of 0.05 produced L2(diff) ≈ 0.4 between successive baseline windows even though nothing meaningful had changed. Magnitude = 0.4 / 0.05 = 8σ — well above the 1.4σ threshold. Fix: divide L2 by sqrt(N) too, so the magnitude is *RMS per-dim deviation* in σ-units, not *total L2 norm*. Calibration emulation results before/after iter 63: scenario before after -------------------- ------ ----- stable_session 80% 0% (was firing on within-session noise) clear_shift_session 80% 50% (still detects real stress shift) gradual_shift 80% 55% (still detects gradual escalation) oscillating 80% 0% (alternation now correctly cancels) high_noise 80% 0% (sigma floor + RMS rejects pure noise) Adjusted one regression test threshold (test_mixed_shape_embeddings) from > 0.5 to > 0.3 to match the post-iter-63 RMS-based magnitudes.
Tamer reports 'Typing rhythm 0 ms' and 'Edit profile 0/turn' persist even after iter 41 + iter 58 + server restart. Composition cadence shows real values (1.29 s) on the same dashboard, so the message path IS reaching the profile aggregator — but iki_mean and edit_count are both reading zero in the snapshot. Added a single-line INFO log per message turn that captures: * which composition_metrics keys arrived from the client * the raw composition_ms / edit_count after extraction * how many entries each fallback level produced (buffer / comp_timings / scalar) * what the final keystroke_timings list looks like Now Tamer can grep the server log for '[ws-keystroke-trace]' after typing a few messages and immediately see whether: * the JS payload is missing fields (then this is a JS-side bug) * extraction is producing zeros (then iter 41/58 fixes need adjustment) * the pipeline / aggregator is the problem (next layer to instrument)
Pairs with the iter-64 websocket trace. If '[ws-keystroke-trace]' shows real keystroke_timings being routed but '[profile-update-trace]' shows iki_mean=0, the bug is in _iki_stats's zero-filter or the profile aggregator's accumulator. Together the two log lines pin exactly where the live dashboard's 'Typing rhythm 0 ms' regression originates.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Originally a 33-iteration precision sweep over correctness + numerical robustness; iters 34-36 now also fix three user-visible adaptation bugs that explained why the system "didn't feel adaptive when typing".
What changed (38 commits, 209+ tests, 19 suites)
Iters 34-36 — visible adaptation fixes (the user-experience layer)
56288b5mean_word_length / 10andflesch_kincaid / 20were re-dividing already-normalised features. cognitive_load saturated around 0.6 even on the most complex inputs. After fix: full [0.10, 0.90] range. The post-processor's reply-length tiering now actually responds to user complexity.01d86f2message_length / 0.7was tuned for 350-word essays. Real chat (5-30 words) produced verbosity 0.014-0.143; the post-processor's hedge-strip path always fired regardless of how much the user typed. After fix (/ 0.10): tiny msgs → strip; mid msgs → no rewrite; long msgs → append follow-up.c38b409q*0.3 + (1-q)*0.7capped directness at [0.3, 0.7]. The cloud prompt-builder gated "be more direct" ondirectness > 0.7(strict) — DEAD CODE. After fix: directness reaches [0.15, 0.85] so both directness instructions actually fire.Iters 1-33 — correctness + numerical robustness
(See CHANGELOG
[2026-04-28] Iter 52for the full list.)Code fixes (15): tiered keystroke trigger; Bessel-corrected variance × 3; softmax calibration; cosine topic_coherence; embedding canonicalisation × 2; fixed-baseline anchor; slope stability; confidence field + corroboration bonus; min-σ floor; gradated zero-baseline pct; NaN-safe clamps × 2; scalar-tensor handling; NaN-quarantine in BaselineTracker.
Test additions (12): single-user emulation, 5-archetype cohort emulation, keystroke isolation, Hypothesis property fuzzing × 2, integration × 2, snapshot × 4, monotonicity, boundary, cross-component consistency.
Test plan
🤖 Generated with Claude Code